Analysis of variance

In the examples in this article, data is generated every time the page loads. If you want to see an example with different values - reload the page.

ANOVA

ANOVA in statistics is a powerful tool for determining the influence of different groups of observations among themselves. The analysis of variance was introduced by Fisher, an English scientist who made a huge contribution to the development of science. ANOVA is an acronym for ANalysis Of VAriance.

Example

Suppose you want to conduct an empirical study of gasoline quality, for this you fill up the tank at one gas station and drive n kilometers, repeat such an experiment, say, five times, then conduct the same experiment, only at a different gas station. You have two sets of data - refueling A and refueling B. Certainly, the figures are scattered, but there is still some dependence, so that would determine whether refueling affects gasoline consumption (or the data are not related) You are using variance analysis.

The analysis of variance allows you to determine which of the factors affects more, intra-group or intergroup. In the example above, you will be able to determine how much the choice of gas station affects gasoline consumption. This is the essence of the dispersion analysis: to find out whether the selected factor is significant for the selected observations.

In a sense, the analysis of variance is similar to regression and correlation analyses, because it allows determine the influence of variables on each other.

Analysis

In theory, a simple model is built to analyze the variance, similar to the one studied in time series analysis.

Model

The model of the analysis of variance includes the average value, the effect of the experiment and a random error:

y = μ + τ + ε
τ - experiment effect, ε - random error

Single-factor

One-factor analysis of variance considers the influence of one criterion, it is done this way: we conduct two experiments, in one of them we include an additional factor and analyze whether this factor has made changes. As initial data, consider the results of a number of experiments:

N	E₁	E₂	E₃	E₄
1	60	59	139	40
2	37	51	81	45
3	35	33	136	51
4	52	59	70	31
5	47	36	124	58
μ_i	46.2	47.6	110	45

μ = (46.2 + 47.6 + 110 + 45) / 4 = 62.2
The square of errors within groups (Square Sum within group):
SS_w = Σ_iΣ_j(y_ij - μ_i)² = 5634
The square of errors between groups (Square Sum between group):
SS_b = Σ_i(μ_i - μ)² = 3049.84
Given the degrees of freedom, the expected average is:
MS_w = SS_w / a(n-1) = 375.6
MS_b = SS_b / a-1 = 762.46
Value of F_crit :
F₀ = MS_b/MS_w = 2.03

Fischer's test: if the value of F₀ turns out to be greater than the value of F _λ,4,15, then the factor has an impact.

For n = 20 and a = 5, F_λ,n-a,a-1 = F_λ,15,4= 5.86
Since F₀ = 2.03 < 5.86, then we assume that the introduced factor did not have an effecton the results of the experiment.

Two-factor

In two- factor analysis , three hypotheses are put forward for verification:

Factors A and B do not affect the result
Factor A does not affect the result
Factor B does not affect the result

To carry out a two-factor analysis, it is necessary to make groups of results: several measurements for all values of each of the factors, i.e.:

	A₁	A₂
B₁	X1_a1,b1...XN_a1,b1	X1_a1,b2...XN_a1,b2
B₂	X1_a1,b2...XN_a1,b2	X1_a1,b2...XN_a1,b2

Next, the average value for each factor value is calculated, i.e. the average for A1, the average for B1, etc. Then it is calculated the total average for all results. Let's set the number of criteria: k = 2 (the number of criteria A) and m = 2 (the number of criteria B).

T = ΣΣΣx_ijk
The sum of elements under the influence of factor A:
T_Ai = Σx_i·k
The sum of elements under the influence of factor B:
T_Bj = Σx_·jk
The sum of elements under the influence of factor AB:
T_AiBj = Σx_ij·
SST = Σx²_ijk - T²/N
SSA = ΣT²_Ai/n·m - T²/N
SSB = ΣT²_Bj/n·k - T²/N
SSAB = ΣΣT²_AiBj/n - SSA - SSB - T²/N
SSE = ΣΣΣx²_ijk - ΣΣT²_AiBj/n

SST = SSA + SSB + SSAB + SSE

MSE = SSE/(n-1)·m·k
MSA = SSA/k-1
MSB = SSB/m-1
MSAB = SSAB/(m-1)·(k-1)
Test "Criterion A does notaffect the result", ν₁= k-1:
F_A = MS_A/MS_E
Test "Criterion B does notaffect the result", ν₁= m-1:
F_B = MS_B/MS_E
Test "Criteria A and B do notaffect the result", ν₁ = (k-1)(m-1):
F_int = MS_AB/MS_E

For each F, if F > F _{α,ν₁,ν₂}, then the hypothesis is rejected. ν₂ = N-mk

Multifactorial

Multivariate analysis is similar to two-factor analysis - the same operations are performed, but the criteria are grouped and the influence of each of the factors is found iteratively.

With repeated measurements

The analysis of variance with repeated measurements indicates that several tests were performed for each criterion measurements of a random variable to obtain a more accurate result (since ANOVA) uses the intra-group sum of squares.

Application

Dispersion analysis is used in a wide variety of branches of science and production when it is necessary to study the dependence of the criteria on the difference in average values, while comparing not the average value, but the spread the results are around the mean, i.e. the variance.

Solving problems

As an example, let's give a problem from metrology. The plant houses five machines that produce shafts. It is necessary to determine whether the choice of a machine tool or the training of an employee affects the result of production. For analysis measurements are made for each machine and employee, the result is a table:

Operator 1
M1	30.224	30.242	30.201	30.287	30.231	30.263	30.26	30.244	30.268	30.231
M2	30.337	30.376	30.366	30.398	30.311	30.328	30.381	30.34	30.352	30.398
M3	30.542	30.448	30.7	30.542	30.609	30.457	30.638	30.547	30.626	30.532
M4	30.388	30.343	30.377	30.333	30.313	30.36	30.33	30.381	30.386	30.337
M5	30.381	30.36	30.398	30.386	30.353	30.306	30.315	30.4	30.37	30.335
Operator 2
M1	31.099	31.3	30.546	31.158	31.226	30.428	30.61	31.208	30.93	31.128
M2	30.11	30.15	30.298	30.142	30.246	30.169	30.273	30.279	30.276	30.203
M3	30.34	30.377	30.333	30.354	30.318	30.355	30.385	30.385	30.385	30.341
M4	30.279	30.142	30.229	30.25	30.29	30.131	30.26	30.293	30.134	30.179
M5	30.357	30.329	30.392	30.383	30.333	30.336	30.367	30.37	30.309	30.345

Let's use the method of two-factor analysis, factor A is the operator, factor B is the machine. Calculate the sums of squares, to do this, you need to calculate the average value for each of the groups:

T	T_A1	T_A2	T_B1	T_B2	T_B3	T_B4	T_B5
3039.891	1518.831	1521.06	612.084	605.733	609.214	605.735	607.125

SSA = 0.05
SSB = 1.459
SSAB = 2.94
SSE = 1.092

MSA = 0.05
MSB = 0.365
MSAB = 0.735
MSE = 0.273

F_A = 0.183
F_B = 1.337
F_AB = 2.692

Critical values for the Fischer test:
F_{crit A} = F_{0.1, 1, 90} = 2.77
F_{crit B} = F_{0.1, 4, 90} = 2.01
F_{crit AB} = F_{0.1, 4, 90} = 2.01

Results table:

The impact of the machine on the result	Yes	0.183 < 2.77
The impact of the employee's qualifications on the result	Yes	1.337 < 2.01
The mutual influence of the employee's qualifications and the choice of the machine on the result	No	2.692 > 2.01

In excel/Open Calc

To solve the variance analysis in a spreadsheet, you will need the following formulas:

sumproduct	Sum of products, used to find the sum of squares
finv	Inverse value of the distribution F - Fisher criterion

Download table in the formats ods and xls.

Download article in PDF format.

Author: Zakhar Telyatnikov
Last edit time: 16.03.2026

30.06.2017

Do you find this article curious? /

Seen: 12 130